STAT 240: Intro to Linear Regression
Overview
Learning Outcomes
- These lectures will teach you how to:
- Assess the strength of a linear relationship visually and with the correlation coefficient
- Understand the linear regression model and evaluate its assumptions
The inference topics below will be covered in a “part 2” file: - Compute a confidence interval for a linear regression coefficient - Carry out a hypothesis test for a linear regression coefficient - Conceptualize and construct prediction and confidence intervals around a regression line
Preliminaries
- Download the file
week09-regression.RmdintoSTAT240/lecture/week09-regression. - Download the file
ggprob.RintoSTAT240/scriptsif you have not already. - Download the file
lake-monona-winters-2024.csvintoSTAT240/dataif you have not already. - Download the file
riley.txtintoSTAT240/data.
Motivation: The Relationship between Two Continuous Variables
An extremely common real-life problem concerns how changing one variable affects another.
Whether your predictors and outcomes are discrete or continuous changes how you approach this problem.
- When your outcome is discrete, we have to use complex methods such as logistic regression that are out of the scope of this introductory class, so we will only consider continuous outcomes here, which have more straightforward inference techniques.
This unit on linear regression will consider a continuous predictor and its effect on a continuous outcome.
After that, in our inference on means unit, we will learn how to assess the relationship between the most simple categorical predictor, a binary variable, and a continuous outcome.
Either way, we can restructure the real-life problem as a question about the parameter of a probability distribution.
Data: Lake Mendota Winter Freezes
- As a case study to explore linear regression with, we return to the Lake Monona freeze data from the ggplot unit.
- We will try to assess the relationship between
year1(the year in which the winter freeze begins) andduration(the number of days Lake Monona was closed due to ice coverage).
A Visual Assessment
Overlaying a
geom_smooth, which is constrained to be straight with the argumentmethod = "lm", shows that the data has a general downward trend.In fact, the slope of the line produced by
geom_smooth(method = "lm")is the main object of interest in this unit.However, just because the data has a general downward trend does NOT mean each year’s closure gets shorter than the last; there is plenty of point-to-point variation around the main trend.
How do we quantify how strong the (linear) relationship is between two continuous variables? The answer is with the correlation coefficient, \(r\).
- To answer these questions with statistical inference, we must first learn about the concept of correlation and the correlation coefficient.
Introduction to Correlation
The Correlation Coefficient, \(r\)
The correlation coefficient, referred to as \(r\), is a measure of the strength of a linear relationship between two continuous variables.
It does not constitute statistical inference on its own - it is just a descriptive statistic - but it does have use on its own, and will eventually show up in inference formulas.
You are not expected to memorize or use this formula; you will use the R command
cor; but it is worth analyzing briefly.
\[ r = \mathsf{Corr}(x,y) = \frac{1}{n-1}\sum_{i=1}^n \left(\frac{x_i - \bar{x}}{s_x} \right) \left(\frac{y_i - \bar{y}}{s_y} \right) \]
- Important properties of this formula:
\(\mathsf{Corr}(x,y) = \mathsf{Corr}(y,x)\), because you can multiply in any order.
\(r\) is unitless; so changing the scale of \(x\) or \(y\) (e.g. multiplying by or adding a constant, changing units such as feet to inches) does not change the correlation.
\(r\) is always in the range [-1, 1], where -1 represents a set of points which lay perfectly on a line with a negative slope, 0 represents a set of points with no evidence at all for a non-horizontal relationship, and 1 represents a set of points which lay perfectly on a line with a positive slope.
\(r\) only assesses the strength of a LINEAR relationship, not any other type of curved or piecewise relationship.
- In the lakes plot above, the lake duration tends to decrease as year increases, so we expect a relatively strong negative correlation coefficient, but not quite perfectly -1.
Calculation with cor(x,y)
The built-in R function
cor()calculates the correlation coefficient between two vectors.We need to extract these vectors out of the existing dataframe; either with the
$operator or with thepull()command.
x = monona %>% pull(year1) # equivalent to monona$year1
y = monona %>% pull(duration) # equivalent to monona$duration
cor(x,y)## [1] -0.5432319
More Correlation Examples
r near 0
\(r\) near 0 indicates a non-horizontal linear relationship does NOT fit the data well.
Once again, \(r\) near 0 indicates a non-horizontal linear relationship does NOT fit the data well. \(r\) is ignorant to any non-linear (curved) relationship in the data, and only focuses on the possibility of a straight line.
r near 1
r near 1 indicates a positive linear relationship fits the data well.
Once again, \(r\) near 1 indicates a positive linear relationship fits the data well. But that doesn’t mean that a linear relationship is necessarily the underlying truth! This data is clearly being generated by a non-linear process, but \(r\) is ignorant to that.
r near -1
\(r\) near -1 implies that a negative linear relationship fits the data well.
\(r\) near -1 implies that a negative linear relationship fits the data well… but \(r\) is ignorant to any non-linear relationship, which is clearly present in this example.
r near 0.7
\(r\) near 0.7 indicates a positive linear relationship fits the data somewhat well.
\(r\) near 0.7 indicates a positive linear relationship fits the data somewhat well, but a curved relationship may actually be the underlying truth.
Correlation Takeaways
\(r\) is a measure of the strength of a (non-horizontal) linear relationship between two continuous variables.
- Negative values indicate a downward sloping relationship, and positive values indicate an upward sloping relationship.
A high or low \(r\) alone does NOT guarantee there really is an underlying non-horizontal linear relationship.
Graph your data!
Introduction to Regression
- The correlation coefficient is just the tip of the iceberg in assessing a linear relationship between two continuous variables.
Linear regression is the branch of statistical inference which estimates and assesses the strength of a true, underlying linear relationship between a response variable and one or more predictors.
Alternative terminology for the response variable includes the dependent variable or the outcome, or often just \(Y\).
Alternative terminology for the predictors include the independent variable(s), the covariate(s), or often just \(X\).
- In this class, we will only investigate the simplest case of linear
regression; one continuous response and one continuous predictor.
- However, linear regression is an extremely deep branch of statistics; and regression is not just limited to linear regression either.
- Just like all statistical inference, we will be creating confidence intervals for and running hypothesis tests on a population parameter of interest, but that parameter will not be as simple as a proportion or a mean.
Motivation
In the above section on the correlation coefficient, we stress that it seeks a non-horizontal linear relationship. Why is a horizontal linear relationship unhelpful?
Let’s circle back to one of the motivating questions for this lecture series: “Does knowing X allow us to make a better prediction about the value of Y?”
A horizontal linear relationship corresponds to X giving you no additional information about Y.
For example, consider trying to predict a baseball team’s winning percentage with something completely unhelpful: the latitude of its home stadium.
set.seed(20240406)
made_up = tibble(
win_perc = runif(25, 0, 1),
latitude = runif(25, 25, 48)
)
ggplot(made_up, aes(latitude, win_perc)) +
geom_point() +
labs(title = "Winning Percentage vs. Latitude of Home Stadium",
y = "Winning Percentage",
x = "Latitude (degrees)",
caption = "Disclaimer: Not real data.")I tell you that I have another new baseball team whose latitude is 30 degrees. What is your best prediction for the winning percentage of this new team?
Since the latitude doesn’t tell you anything, you would be forced to just predict the average winning percentage; approximately 50%.
What would your prediction be for a new team at latitude of 40 degrees? 50% again; same for any other latitude value. When X tells you nothing, your prediction for Y is always just the average Y value.
ggplot(made_up, aes(latitude, win_perc)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
labs(title = "Winning Percentage vs. Latitude of Home Stadium",
subtitle = "With Prediction Line; Basically horizontal means X is useless!",
y = "Winning Percentage",
x = "Latitude (degrees)",
caption = "Disclaimer: Not real data.")This is why a horizontal relationship is unhelpful; it means that knowing X did not allow us to make a better prediction than not knowing X.
Now let’s switch up the circumstances. Consider predicting the winning percentage of a baseball team by its total wages.
If a team spent average wages, you would still predict the average winning percentage, 50%. But if a team had higher wages? You would predict a higher winning percentage! In this situation, knowing X allowed us to make a better prediction than if we didn’t know X.
set.seed(20240406)
made_up_again = tibble(
wages = rnorm(25, 5, 2),
win_perc = wages* rnorm(25, 1, 0.5),
win_perc_adj = (win_perc - min(win_perc))/max(win_perc - min(win_perc))
)
ggplot(made_up_again, aes(wages, win_perc_adj)) +
geom_point() +
geom_smooth(se = F, method = "lm", alpha = 0.5) +
labs(title = "Winning Percentage vs. Wages",
y = "Winning Percentage",
x = "Wages (Hundreds of millions of $,\nI don't really know, this isn't real data anyway)")- Thus, non-horizontal linear relationships show us that one variable helps us to explain how another changes. (This example showed a positive relationship, but a negative one is useful too! For example, cigarette usage versus life expectancy.)
A common theme of this course is to turn broad, plain-language questions into those about statistical parameters. How can we turn “is there a relationship between these two variables” into a mathematical question?
With the above example in mind, this question can be re-written as: “Is the slope of the underlying linear relationship equal to zero, or is it something else?”
We assume that the two variables have some true underlying linear relationship, in the form \(Y = mX + b\). We are interested in the value of \(m\), the slope, and particularly whether or not it is 0.
We will not write \(Y = mX + b\), that is just the form most are familiar with. We instead will write \(Y = \beta_0 + \beta_1*X\), such that \(\beta_1\) is the slope of interest.
Note: \(\beta_0\) and \(\beta_1\) are often called “coefficients” because of how linear regression is structured in linear algebra/matrix form. This language will come up repeatedly in how R names these values and associated commands.
The Least-Squares Regression Line
We make the assumption that the two variables have an underlying linear relationship. This assumption is probably not technically true for any given set of data, but it is very useful if you are willing to accept the assumption as close enough to reality.
We can’t know the true values of \(\beta_0\) and \(\beta_1\), we can only hope to estimate them. However, there is not an immediately obvious point estimate from our sample, like estimating the population mean \(\mu\) with the sample mean \(\bar{X}\).
However, we still need to estimate them. We seek to produce estimates of \(\beta_0\) and \(\beta_1\) which give the best predictions of \(Y\) for the given values of \(X\) when we plug them into the regression equation.
- Call these estimates of the intercept and slope respectively \(\hat{\beta_0}\) and \(\hat{\beta_1}\).
A point we have been glossing over here is what we mean by the best prediction. If we can say that one set of predictions is “better” than another, there must be some metric that we are comparing them by.
Consider that we predict a vector of real \(Y\) values with a set of predictions, \(\hat{Y}\). Values of \(\hat{Y}\) which are closer to the real \(Y\) are better.
Therefore, we should seek a set of predictions (e.g. values of \(\hat{\beta_0}\) and \(\hat{\beta_1}\)) which gives us the SMALLEST distance between \(Y\) and \(\hat{Y}\).
- The distance between \(Y\) and \(\hat{Y}\) are called the residuals, and are a very important part of linear regression.
Instead of the difference \(|Y - \hat{Y}|\), however, we will try to minimize the squared difference \((Y - \hat{Y})^2\). This is why it is called the least-squares regression line.
The motivation for this decision comes from centuries of statistical theory beyond the scope of this course; the short answer is that the estimators which minimize the sum of squared residuals are both simple and extremely effective, much more so than if we tried to minimize \(|Y - \hat{Y}|\).
- Once we have our estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\), our predictions of \(Y\) will be:
\[ \hat{Y} = \hat{\beta_0} + \hat{\beta_1} * X \]
\(\hat{Y}\), \(Y\), and \(X\) are all vectors of length
n. Every data point has a real \(X\) value, a real \(Y\) value, and a predicted response value \(\hat{Y}\).We seek to minimize:
\[ \text{Sum of squared residuals} = \sum_{i = 1}^n (Y_i - \hat{Y_i})^2 \]
The algebra/calculus to derive the estimates which minimize the sum of squared residuals is beyond the scope of this course. However, you are expected to know and use the final results.
The slope and intercept estimates which produce the best predictions (minimize the sum of squared residuals) are:
\[ \hat{\beta}_1 = r * \frac{s_y}{s_x} \]
\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 * \bar{x} \]
Dissecting the Estimate Formulas
The estimate of the slope, \(\hat{\beta}_1 = r \times \frac{s_y}{s_x}\), is based on the correlation between \(X\) and \(Y\), and the standard deviations of each.
\(r\) captures the strength of the linear relationship and its direction.
\(s_y\) in the numerator captures the fact that if \(Y\) varies over a large range (e.g. data is “tall”), we will need a higher slope for our regression line to fit the data well.
\(s_x\) in the denominator captures the fact that if \(X\) varies over a large range (e.g. data is “wide”, but not the “wide” like in pivoting, just in the scatter plot visually) then we will need a lower slope for our regression line to fit the data well.
The estimate of the intercept, \(\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\) is based on the estimate of the slope; and there is good reason for this.
The “best” intercept is the one which calibrates the line to go through the center of the data; namely, the ordered pair \((\bar{x}, \bar{y})\).
The construction of \(\hat{\beta}_0\) ensures this always happens:
\[ \hat{Y} = \hat{\beta_0} + \hat{\beta_1} * X \]
\[ \hat{Y} = \bar{y} - \hat{\beta}_1 * \bar{x} + \hat{\beta_1} * X \]
\[ \hat{Y} \text{at $X = \bar{X}$} = \bar{y} - \hat{\beta}_1 * \bar{x} + \hat{\beta_1} * \bar{x} \]
\[ \hat{Y} \text{at $X = \bar{X}$} = \bar{y} \]
Example Manual Calculation
- The following graph shows a plot of height in inches versus age in months of a boy from age 2 years to 8 years.
ggplot(df, aes(x = age, y = height)) +
geom_point() +
geom_smooth(se = FALSE, method = "lm") +
ylab("Height (inches)") +
xlab("Age (months)") +
ylim(c(30, 55)) + scale_x_continuous(breaks = c(0, 40, 60), limits = c(0, 100))The slope looks to be about 1/4; when \(X\) goes up by 20 (40 to 60), \(Y\) goes up by about 5 (40 to 45).
Also, extrapolating (following a trend beyond its existing range) the prediction line to the left, towards the y-axis, it looks like the intercept (the predicted value of \(Y\) when \(X\) is 0) should be about 30.
Along the way in our calculation, we note positive linear relationship fits this data extremely well, so we expect a high correlation coefficient - not 1, since it isn’t perfect, but darn close.
y = df %>% pull(height) # Equivalent to df$height
x = df %>% pull(age) # Equivalent to df$age
r = cor(x, y)
r ## [1] 0.9990835
beta_hat_1 = r * sd(y) / sd(x)
beta_hat_0 = mean(y) - beta_hat_1 * mean(x)
c(beta_hat_1, beta_hat_0)## [1] 0.2505185 30.2493290
- Therefore, for a given
age, the best prediction ofheightis \(\hat{Y} \approx 30 + 0.25 *\)age.
Example Calculation in R, with lm()
Like
t.test, R has a command to calculate these estimates for you; this command islm(), standing for “linear model”.lm()is an extremely versatile command designed to handle many of the most common cases of linear regression.We will only scratch the surface of the things you can do with it.
The first and only mandatory argument to
lm()is a formula of the formy ~ x, whereyis the response variable andxis the predictor variable.
Make sure you have this order correct! Response variable ~ Predictor Variable!
# If you already have the vectors "pull"ed out of the dataframe, you can just plug those in as they are!
lm(y ~ x)##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 30.2493 0.2505
# But lm() also gives you the convenient "data" argument, where you can pass in a dataframe, and provide it the names of columns so you don't have to extract them.
lm(height ~ age, data = df)##
## Call:
## lm(formula = height ~ age, data = df)
##
## Coefficients:
## (Intercept) age
## 30.2493 0.2505
Extensions of lm(); coef() and summary()
lm()is doing way more in the background than just printing the coefficient estimates; but that is all it does if you don’t save the results.
model_object, the stored results oflm(), contains many important values, including the coefficients.To see more information, we can pass
model_objectintosummary().
##
## Call:
## lm(formula = height ~ age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29911 -0.19291 -0.03355 0.21982 0.46334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.249329 0.186745 161.98 <2e-16 ***
## age 0.250519 0.002869 87.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2429 on 14 degrees of freedom
## Multiple R-squared: 0.9982, Adjusted R-squared: 0.998
## F-statistic: 7627 on 1 and 14 DF, p-value: < 2.2e-16
There is almost an overwhelming amount of information here, but the “coefficients” table in the middle is a gold mine of wonderful information.
There are point estimates for both the intercept and slope, with associated standard errors, and information about a hypothesis test that we’ll circle back to.
- To grab just the two coefficient estimates from all this
information, we can run the model object through the
coef()command.
## (Intercept) age
## 30.2493290 0.2505185
- This gives us a vector that we can index into with
[#]to extract the individual values.
## (Intercept)
## 30.24933
## age
## 0.2505185
- Once again, we arrive at the predictive model: for a given
age, the best prediction ofheightis \(\hat{Y} \approx 30 + 0.25 *\)age.
Making Predictions
- Now that we have this predictive model (which we know to be the “best” linear model), we can answer questions like:
Predict the boy’s height at age = 100 months (8 years, 4 months).
- From above, with \(\hat{\beta}_0 \approx 30\) and \(\hat{\beta}_1 \approx 0.25\), the prediction is \(\hat{y} = 30 + 0.25 * 100 \approx 55\).
new_x = 100
y_hat = beta_hat_0 + beta_hat_1 * new_x
y_hat # Note: the "name" of the number sometimes persists when you use estimates from coef(). You can ignore the name if it carries through. ## (Intercept)
## 55.30118
- Graphically, if you continue the regression line to \(x = 100\), its height will be \(y \approx 55\).
## Visualization of this prediction
ggplot(df, aes(x = age, y = height)) +
geom_point() +
# fullrange = T extends the blue line past the X range of the existing points
geom_smooth(se = FALSE, method = "lm", fullrange = T) +
geom_point(aes(x=new_x, y=beta_hat_0 + new_x*beta_hat_1), color="red", size=4) +
geom_vline(xintercept = new_x, color="red", linetype="dashed") +
geom_hline(yintercept = y_hat, color = "red", linetype = "dashed") +
ylab("Height (inches)") +
xlab("Age (months)")Through Standardized Units
There is also a standardized way to consider these predictions.
To arrive at this equation, we plug in our estimates:
\[ \hat{\beta}_1 = r * \frac{s_y}{s_x} \]
\[ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 * \bar{x} \]
…into the predictive equation:
\[ \hat{Y} = \hat{\beta_0} + \hat{\beta_1} * X \]
…which gives:
\[ \hat{Y} = \bar{y} - r * \frac{s_y}{s_x} * \bar{x} + r * \frac{s_y}{s_x} * X \]
- We can algebraically rearrange this to get:
\[ \hat{Y} = \bar{y} + r * s_y * \frac{X - \bar{X}}{s_x} \]
If \(X\) is \(z\) standard deviations (\(s_x\)) away from its mean, then \(\hat{Y}\) will be \(r*z\) standard deviations (\(s_y\)) away from its mean (\(\bar{Y}\)).
- More extreme \(X\) values will lead to more extreme \(Y\) values in the direction (and magnitude) of the correlation coefficient.
or, equivalently,
\[ \frac{\hat{Y} - \bar{y}}{s_y} = r * \frac{X - \bar{X}}{s_x} \]
The standardized score of \(\hat{Y}\) is r * the standardized score of \(X\), which is \(\frac{x - \bar{x}}{s_x}\).
The value \(x = 100\) is \(z = (100 - \bar{x}/s_x)\) standard deviations from its mean.
Therefore, \(\hat{y}\) for \(x = 100\) is \(r*z\) standard deviations away from its mean.
## [1] 55.30118
Back to the Lake Monona Data
- Let’s try this with the Lake Monona data
# Manual calculation of slope and intercept
x = monona$year1
y = monona$duration
r = cor(x, y)
beta_hat_1 = r * sd(y) / sd(x)
beta_hat_0 = mean(y) - beta_hat_1 * mean(x)
beta_hat_0## [1] 535.1252
## [1] -0.2229849
##
## Call:
## lm(formula = duration ~ year1, data = monona)
##
## Coefficients:
## (Intercept) year1
## 535.125 -0.223
- Predict the duration Lake Monona will be closed in the 2026-2027 winter.
## [1] 83.35771
## [1] 83.35771
Sidenote: Extrapolation
- The calculation we did above has an implicit assumption; that our linear model is valid beyond the range of the existing data.
Using a model to make predictions for new X values beyond the range of the existing data is called extrapolation.
Extrapolation requires you to assume the modeled relationship is valid beyond the X values observed in your data. Whether this is valid depends on the circumstances of your data and how far away the new X value is.
- It is easy to blindly trust the “83 days” prediction above, but what if we tried to predict the number of days of closure in the 2400-2401 winter?
## [1] -0.03865795
Obviously, the lake cannot be closed for a negative number of days, and this prediction is not valid.
But where is the line between the 2025-2026 prediction being valid and the 2400-2401 prediction being invalid?
There is no mathematical “line” when predictions stop being valid - it is all about whether you are willing to accept the linear relationship as being close enough to reality to be useful, given the circumstances of the data and just how far away the new X point is.
The Model and Assumptions
In order to study our parameter of interest, \(\beta_1\), we need to give it an exact definition - what is the thing we are studying, how does it relate to our observed data or the process that generates it?
That statement is called a model. For example, \(X \sim Binom(10, p)\) is a model, that relates our observed data (X, the number of successes) to our unknown parameter of interest \(p\).
- A model should always have your unknown parameter of interest in it.
Models carry with them certain assumptions that need to be checked.
- The model above, \(X \sim Binom(10, p)\), makes the four BINS assumptions which need to be checked.
Our data in regression is a vector of response data \(Y\), and a vector of predictor data \(X\). Both of them have to have the same length, \(n\), because the values in the \(i\)th point (\(X_i, Y_i\)) are connected; every \(X\) has to have a \(Y\).
In plain English, we might say that every value \(Y_i\) comes from the same linear function of its corresponding \(X_i\) e.g. \(Y_i = \beta_0 + \beta_1 * X_i\) for \(i = 1,...,n\).
Accounting for Variability
This is close to our final model… but it misses a fundamental point about the linear relationships we are trying to study.
Clearly, real data does not fall exactly on a straight line. If it was, this whole thing would be really easy.
We have always talked about the underlying linear relationship, around which there can be some variability.
We do NOT assert that our real data will look like this:
example = tibble(
x = runif(10, 0, 10),
y = 2 + 3*x
)
ggplot(example, aes(x, y)) +
geom_point(size = 3) +
geom_smooth(se = F, method = "lm")- We instead assume that this line is where the \(Y\) values will end up on average, but in reality, we observe some normally distributed error around this true line.
set.seed(20230407)
ggplot(example, aes(x + rnorm(10, sd = 3), y)) +
geom_point(size = 3) +
geom_smooth(se = F, method = "lm")You can think of this as: at each point of \(X\), the values of \(Y\) are generated from a normal distribution centered around the height of the true regression line at that \(X\).
Here’s a graph which overlays how likely the \(Y\) value is to be at each location for \(X = 2\) and \(X = 6\); at a normal distribution around the regression line.
- Important points to take away here are:
The true line dictates where the Y values will be on average, but there is random variation around that true line.
The standard deviation/spread of the normal distribution around the true line does not change as you move up and down the X line; e.g. its standard deviation is constant/independent of X.
This is often referred to as the “constant variance” assumption, since variance was more useful to the people who named it, but constant variance/constant standard deviation are the same thing.
With this image in mind, a natural first guess is \(Y_i \sim N(\beta_0 + \beta_1 * X_i, \sigma)\), which generates each \(Y_i\) according to a normal distribution centered at the true regression line, as we see above. \(\sigma\) is the true, unknown standard deviation of the errors around the true line.
However, to change this into a more convenient form, we will “un-shift” the normal distribution by taking the mean out.
The Model
- We model linear regression as:
\[ Y_i = \beta_0 + \beta_1 * X + \varepsilon_i \text{, for } i = 1,...n \] \[ \text{where }\varepsilon_i \sim N(0, \sigma) \]
We recognize \(\beta_0 + \beta_1*X\) as our standard slope-intercept form, but we now have this new term, \(\varepsilon_i \sim N(0, \sigma)\), pronounced “epsilon-i”.
This new term is called the error term. It is a true, unknown parameter for each point, which captures how far off the true line each point is.
We assume is normally distributed with an expected value of 0; e.g., points vary equally above and below the line; and a standard deviation of \(\sigma\), another true, unknown parameter which controls how wide the error variation is around the true line.
Sidenote: The Signal and the Noise
This is not an official thing that we would ever test you on, but it is one of my favorite statistical concepts, and it might be helpful in understanding inference, so I thought I would share it!
This error term is also sometimes referred to as the random noise.
The reason for calling it “noise” goes back to an old metaphor about statistical inference.
Consider trying to talk to someone using a walkie-talkie or radio.
The thing you really care about is their voice, what they are saying - the signal. This is analogous to the real statistical relationship, the real parameters you are interested in.
However, there is often an obstacle to extracting the signal - this walkie-talkie or radio has some static feedback/noise also being played at the same time (the error/variation around the true regression line).
We would love for the noise to be much quieter than the signal. But as the noise gets louder and louder (more variation) it becomes harder to extract the voice (the true line).
Nate Silver, a prominent political data scientist and election predictor, has an excellent book on prediction called The Signal and The Noise that I highly recommend!
The Assumptions
- There are three assumptions implicit in this model. Unfortunately, we don’t have a catchy “BINS”-like acronym this time around.
- The relationship between X and Y is actually linear, as opposed to a curved/non-linear relationship.
- The errors are normally distributed around 0.
- The errors have constant variance/standard deviation, which does not change with X.
Checking Assumptions with Residual Plots
All three of the above assumptions can be checked with one graph, called a residual plot.
Recall that a residual is the distance between the real \(Y_i\) and the predicted \(\hat{Y}_i\) for the \(i\)th point in your data.
A residual is the observed, sample-based estimate of the true, unknown error \(\varepsilon_i\).
A residual plot graphs your \(X\) variable on the horizontal like usual, but instead of \(Y\) on the vertical axis, it plots the residuals, \(Y_i - \hat{Y}_i\).
Obtaining the Residuals
- How do we obtain the residuals and add them to our dataframe for
plotting? Through the object returned by
lm()!
Just like we can extract the “coefficient” estimates from the model object with
coef(), we can extract the residuals withresid().Like
coef(), this has returned a “named vector”, where we can ignore the names.
## 1 2 3 4 5 6
## -3.48813087 29.73485406 -2.04216101 -26.81917607 -9.59619114 8.62679379
## 7 8 9 10 11 12
## 12.84977873 -0.92723634 11.29574860 -2.48126647 5.74171846 8.96470340
## 13 14 15 16 17 18
## -10.81231167 7.41067326 19.63365820 -18.14335687 24.07962806 23.30261300
## 19 20 21 22 23 24
## 18.52559793 5.74858287 -26.02843220 15.19455273 -54.58246233 -1.35947740
## 25 26 27 28 29 30
## -23.13649247 44.08649247 -38.69052260 11.53246233 4.75544727 5.97843220
## 31 32 33 34 35 36
## 16.20141713 16.42440207 22.64738700 -27.12962806 -39.90664313 -0.68365820
## 37 38 39 40 41 42
## -17.46067326 3.76231167 -14.01470340 -8.79171846 6.43126647 10.65425140
## 43 44 45 46 47 48
## -11.12276366 17.10022127 0.32320621 6.54619114 -8.23082393 -26.00783899
## 49 50 51 52 53 54
## 29.21514594 0.43813087 4.66111581 -4.11589926 -3.89291433 8.33007061
## 55 56 57 58 59 60
## -13.44694446 -6.22395952 3.99902541 -4.77798966 -25.55500472 6.66798021
## 61 62 63 64 65 66
## 2.89096514 8.11395008 8.33693501 -24.44008006 8.78290488 -21.99411019
## 67 68 69 70 71 72
## -10.77112525 20.45185968 -3.32515539 -3.10217045 20.12081448 -3.65620059
## 73 74 75 76 77 78
## 2.56678435 -9.21023072 -0.98724579 -6.76426085 -44.54127592 2.68170901
## 79 80 81 82 83 84
## -1.09530605 -5.87232112 3.35066382 19.57364875 7.79663368 -0.98038138
## 85 86 87 88 89 90
## -0.75739645 24.46558848 -17.31142658 11.91155835 -0.86545672 -12.64247178
## 91 92 93 94 95 96
## -2.41948685 6.80349809 7.02648302 -8.75053205 12.47245289 17.69543782
## 97 98 99 100 101 102
## 11.91842275 -6.85859231 -27.63560738 -16.41262245 9.81036249 -6.96665258
## 103 104 105 106 107 108
## 10.25633236 26.47931729 2.70230222 -0.07471284 15.14827209 14.37125702
## 109 110 111 112 113 114
## -5.40575804 16.81722689 -31.95978818 -8.73680324 -9.51381831 4.70916663
## 115 116 117 118 119 120
## -4.06784844 -2.84486351 -0.62187857 -5.39889364 -1.17590871 0.04707623
## 121 122 123 124 125 126
## -4.72993884 8.49304609 26.71603103 22.93901596 10.16200090 -7.61501417
## 127 128 129 130 131 132
## 6.60797076 -37.16904430 14.05394063 -0.72307444 20.49991050 -7.27710457
## 133 134 135 136 137 138
## -1.05411964 1.16886530 6.39185023 -3.38516484 -0.16217990 17.06080503
## 139 140 141 142 143 144
## 4.28378997 -16.49322510 16.72975983 5.95274477 -40.82427030 -19.60128537
## 145 146 147 148 149 150
## -37.37830043 18.84468450 -32.93233057 -5.70934563 -11.48636070 8.73662424
## 151 152 153 154 155 156
## 5.95960917 -20.81740590 34.40557904 16.62856397 0.85154890 28.07453384
## 157 158 159 160 161 162
## -17.70248123 14.52050370 34.74348864 18.96647357 -23.81054149 -4.58755656
## 163 164 165 166 167 168
## 7.63542837 11.85841331 -2.91860176 -1.69561683 -2.47263189 6.75035304
## 169
## -40.02666203
- We can add the residuals as a new column in our dataframe called
residwith a simplemutate().
## # A tibble: 6 × 3
## year1 duration residuals
## <dbl> <dbl> <dbl>
## 1 1855 118 -3.49
## 2 1856 151 29.7
## 3 1857 119 -2.04
## 4 1858 94 -26.8
## 5 1859 111 -9.60
## 6 1860 129 8.63
- We can also augment the predicted values \(\hat{Y}\) onto our data frame with similar
commands; the base R command to extract the predictions from a model
object is
predict(), with the helpfulmodelrshortcut functionadd_predictions().
## 1 2 3 4 5 6 7 8
## 121.48813 121.26515 121.04216 120.81918 120.59619 120.37321 120.15022 119.92724
## 9 10 11 12 13 14 15 16
## 119.70425 119.48127 119.25828 119.03530 118.81231 118.58933 118.36634 118.14336
## 17 18 19 20 21 22 23 24
## 117.92037 117.69739 117.47440 117.25142 117.02843 116.80545 116.58246 116.35948
## 25 26 27 28 29 30 31 32
## 116.13649 115.91351 115.69052 115.46754 115.24455 115.02157 114.79858 114.57560
## 33 34 35 36 37 38 39 40
## 114.35261 114.12963 113.90664 113.68366 113.46067 113.23769 113.01470 112.79172
## 41 42 43 44 45 46 47 48
## 112.56873 112.34575 112.12276 111.89978 111.67679 111.45381 111.23082 111.00784
## 49 50 51 52 53 54 55 56
## 110.78485 110.56187 110.33888 110.11590 109.89291 109.66993 109.44694 109.22396
## 57 58 59 60 61 62 63 64
## 109.00097 108.77799 108.55500 108.33202 108.10903 107.88605 107.66306 107.44008
## 65 66 67 68 69 70 71 72
## 107.21710 106.99411 106.77113 106.54814 106.32516 106.10217 105.87919 105.65620
## 73 74 75 76 77 78 79 80
## 105.43322 105.21023 104.98725 104.76426 104.54128 104.31829 104.09531 103.87232
## 81 82 83 84 85 86 87 88
## 103.64934 103.42635 103.20337 102.98038 102.75740 102.53441 102.31143 102.08844
## 89 90 91 92 93 94 95 96
## 101.86546 101.64247 101.41949 101.19650 100.97352 100.75053 100.52755 100.30456
## 97 98 99 100 101 102 103 104
## 100.08158 99.85859 99.63561 99.41262 99.18964 98.96665 98.74367 98.52068
## 105 106 107 108 109 110 111 112
## 98.29770 98.07471 97.85173 97.62874 97.40576 97.18277 96.95979 96.73680
## 113 114 115 116 117 118 119 120
## 96.51382 96.29083 96.06785 95.84486 95.62188 95.39889 95.17591 94.95292
## 121 122 123 124 125 126 127 128
## 94.72994 94.50695 94.28397 94.06098 93.83800 93.61501 93.39203 93.16904
## 129 130 131 132 133 134 135 136
## 92.94606 92.72307 92.50009 92.27710 92.05412 91.83113 91.60815 91.38516
## 137 138 139 140 141 142 143 144
## 91.16218 90.93919 90.71621 90.49323 90.27024 90.04726 89.82427 89.60129
## 145 146 147 148 149 150 151 152
## 89.37830 89.15532 88.93233 88.70935 88.48636 88.26338 88.04039 87.81741
## 153 154 155 156 157 158 159 160
## 87.59442 87.37144 87.14845 86.92547 86.70248 86.47950 86.25651 86.03353
## 161 162 163 164 165 166 167 168
## 85.81054 85.58756 85.36457 85.14159 84.91860 84.69562 84.47263 84.24965
## 169
## 84.02666
monona %>%
mutate(
residuals = resid(model),
predictions = predict(model)) %>%
select(year1, duration, predictions, residuals) %>%
head()## # A tibble: 6 × 4
## year1 duration predictions residuals
## <dbl> <dbl> <dbl> <dbl>
## 1 1855 118 121. -3.49
## 2 1856 151 121. 29.7
## 3 1857 119 121. -2.04
## 4 1858 94 121. -26.8
## 5 1859 111 121. -9.60
## 6 1860 129 120. 8.63
Creating the Plot
A residual plot keeps our original, unedited \(X\) variable on the horizontal axis, and puts the residuals on the \(Y\) axis.
- It will be helpful in evaluating the assumptions of linear regression (the whole point of making a residual plot) to add a horizontal line at x = 0.
# As a reminder if you've skipped ahead to this section, model = lm(duration ~ year1, data = monona)
monona %>%
mutate(residuals = resid(model)) %>%
# Relatively straightforward scatter plot
ggplot(aes(x = year1, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0)Like the binomial BINS assumptions, evaluating linear regression assumptions is a very subjective endeavor, often with unclear results; especially on real data.
- For this reason, we will first look at some small examples where the violations are exaggerated to be obvious, before returning to the real plot above, where violations may not be obvious.
Assessing A1: Linearity
If the residuals show an obvious non-linear pattern, the linearity assumption is violated.
- Here is an example residual plot where linearity is violated:
- Recall that
geom_smooth()by default will curve if that is the best fit to the data. We often use this to our advantage when assessing linearity; if you can’t tell if there’s a curve pattern, add ageom_smooth, but don’t constrain it to be straight! If it curves significantly, then linearity is violated.
- Here is an example where the linearity assumption is reasonable, because we see **no obvious non-linear pattern.*
## geom_smooth: na.rm = FALSE, orientation = NA, se = FALSE
## stat_smooth: na.rm = FALSE, orientation = NA, se = FALSE
## position_identity
- The
geom_smooth()line generally follows the center line. Some variation is expected due to randomness, that is totally okay.
Assessing A2: Normal Errors Around 0
Note: In practice, we usually assess normality with a different method called a quantile-quantile (“q-q”) plot, which is better at detecting smaller deviations from normality. However, it is still visible with a residual plot.
- This assumption is a two-parter, but they tend to be violated or satisfied together, so I have combined them into one.
If the points don’t tend to be near the line on average, then the normal errors around 0 assumption is violated.
If the points aren’t evenly spread in both directions around 0, then the normal errors around 0 assumption is violated.
- Here is an example where normal errors are violated. The points don’t tend to cluster towards the line, most of the points are away from the line. (This suggests that 0 is not the most likely value, as it would be in a normal distribution of errors.)
If they were normally distributed, then outliers (away from the central line) would be uncommon.
Another way this assumption can go wrong is if the residuals are significantly asymmetric; skewed in one direction or the other.
For example, the below residual plot shows that most of the residuals are negative. (These residuals still average to 0 because of the larger positive ones, but it is not symmetric.)
set.seed(20240407)
nonnormal_data2 = tibble(
x = runif(100, 0, 10),
y = 2*x + rgamma(100, 1) - 1
)
createResidualPlot(nonnormal_data2) +
ggtitle("Normality Around 0 Violated",
subtitle = "Spread is not Symmetric")- Our familiar “satisfactory” residual plot satisfies this normal errors around 0 assumption, generally, points tend to stay close to the line, with outliers uncommon, AND the spread is even in both directions.
Assessing A3: Constant Variance/SD
If the spread of the residuals around the center line increases or decreases as you move horizontally across the plot (e.g. the points “fan out” or “funnel in”), then constant variance is violated.
We won’t test you on this, but the fancy term for constant variance is “homoskedasticity”; the fancy term for a violation of this (e.g. non-constant variance) is “heteroskedasticity”.
When we write \(\varepsilon_i \sim N(0, \sigma)\) in our model, this assumption reflects the fact that \(\sigma\) is just a single number, not dependent on \(X\) at all, for example, \(N(0, \sigma + X_i)\) or \(N(0, \sigma * X_i)\).
In the below example, the variance of the errors around the center line increases with \(X\), indicating a violation of constant variance.
- And here’s the other common way the assumption can be violated; the variance decreases with \(X\), or the points “funnel in”:
- Finally, our familiar satisfactory plot fits the bill again; the spread does not change as you move horizontally.
createResidualPlot(linear_data) +
ggtitle("Constant Variance Satisfied", subtitle = "Points show no obvious spreading out or funnelling in from left to right")One could argue there is a slight “tightening” of the data to the right side, but it is not obvious… even in these fake examples, unclear results are inevitable!
Real Example: Lake Monona Data
- Let’s return to the residual plot from the real Lake Monona data and try to assess all three assumptions.
# See: "Creating the Plot" section
monona %>%
mutate(residuals = resid(model)) %>%
ggplot(aes(x = year1, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0)Linearity is satisfied; the residuals don’t show an obvious curve pattern.
Normal errors around 0 is satisfied; the residuals tend towards the central line, with large outliers uncommon, and they are symmetric.
Constant variance is satisfied; the residuals do NOT fan out or funnel in as you increase X (year).